Data Exploration Notebook¶

The purpose of this notebook is to explore the dataset in order to gain insights for future model building. It also works up to step 8 in the Supervised ML flow chart.

Import Packages¶

In [1]:
import sys
import time
from pathlib import Path
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.feature_selection import VarianceThreshold
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

Set up for imports of .py modules¶

In [2]:
path = Path(os.getcwd())
path = str(path)
print(path)
sys.path.insert(1, path)
/Users/evangelinekim/Developer/Projects/Roux_Institute/DS5220/DS5220-Supervised-ML-Project

Import Python Modules¶

In [3]:
import utils.sml_utils as sml_utils

Parameters¶

In [4]:
path_to_data = 'data/winequality-white.csv'
target_attr = 'quality'
test_size = 0.20
train_test_split_random_state = 42
missingness_threshold = 0.20

Set up time¶

In [5]:
start = time.time()

Reading in data¶

In [6]:
df = pd.read_csv(path_to_data, sep=";")
print(df.shape)
df.head()
(4898, 12)
Out[6]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6
1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6
2 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6
3 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
4 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6

1. Missingness check¶

This was completed in phase 1!

In [7]:
print(df.shape)
df = df.dropna(subset=target_attr)
print(df.shape)
(4898, 12)
(4898, 12)

2. Train/Test Data Split¶

wine_train_df and wine_test_df were already created in phase 1! Here, we are making a copy of the data for exploration.

In [8]:
train_df = pd.read_csv('data/wine_train_df.csv').copy() # Make copy so original is not affected
train_cap_x_df = train_df.iloc[:, :-1]  # All columns except the last one
train_y_df = train_df.iloc[:, -1].to_frame()

3. Train/Validation Split¶

This step will be omitted because cross-validation will be used in later steps

4. Checking Attributes Types¶

In [9]:
train_cap_x_df.dtypes
Out[9]:
fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
dtype: object
In [10]:
train_cap_x_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3918 entries, 0 to 3917
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         3918 non-null   float64
 1   volatile acidity      3918 non-null   float64
 2   citric acid           3918 non-null   float64
 3   residual sugar        3918 non-null   float64
 4   chlorides             3918 non-null   float64
 5   free sulfur dioxide   3918 non-null   float64
 6   total sulfur dioxide  3918 non-null   float64
 7   density               3918 non-null   float64
 8   pH                    3918 non-null   float64
 9   sulphates             3918 non-null   float64
 10  alcohol               3918 non-null   float64
dtypes: float64(11)
memory usage: 336.8 KB
In [11]:
train_cap_x_df.describe()
Out[11]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol
count 3918.000000 3918.000000 3918.000000 3918.000000 3918.000000 3918.000000 3918.000000 3918.000000 3918.000000 3918.000000 3918.000000
mean 6.865046 0.279338 0.332731 6.450702 0.045734 35.094564 138.001149 0.994071 3.189293 0.489781 10.508840
std 0.844483 0.101606 0.119758 5.139311 0.021797 16.676958 42.067667 0.003022 0.150183 0.113590 1.227887
min 3.800000 0.080000 0.000000 0.600000 0.009000 3.000000 10.000000 0.987110 2.720000 0.220000 8.000000
25% 6.300000 0.210000 0.270000 1.700000 0.036000 23.000000 108.000000 0.991740 3.090000 0.410000 9.500000
50% 6.800000 0.260000 0.320000 5.200000 0.043000 33.000000 134.000000 0.993800 3.180000 0.470000 10.400000
75% 7.300000 0.330000 0.380000 10.000000 0.050000 46.000000 167.000000 0.996200 3.280000 0.550000 11.400000
max 11.800000 1.100000 1.660000 65.800000 0.346000 146.500000 313.000000 1.038980 3.820000 1.080000 14.200000

Examining some feature selection configurations (variance threshold, univariate features selection, and recursive feature elimination)¶

In [12]:
variance_threshold = VarianceThreshold(0.01) # I guess we want attributes with non-zero variance?
variance_threshold.fit(train_cap_x_df)
variances = variance_threshold.variances_
variance_scores = pd.DataFrame({'Attribute': train_cap_x_df.columns, 'Variance': variances}).sort_values(by='Variance', ascending=False)#.to_string(index=False)
print("ATTRIBUTE VARIANCE SCORES:")
print(variance_scores)
ATTRIBUTE VARIANCE SCORES:
               Attribute     Variance
6   total sulfur dioxide  1769.236918
5    free sulfur dioxide   278.049953
3         residual sugar    26.405772
10               alcohol     1.507321
0          fixed acidity     0.712970
8                     pH     0.022549
2            citric acid     0.014338
9              sulphates     0.012899
1       volatile acidity     0.010321
4              chlorides     0.000475
7                density     0.000009
In [13]:
# Plotting the variance scores
plt.figure(figsize=(12, 6))
sns.barplot(x='Variance', y='Attribute', data=variance_scores)
plt.title('Attribute Variance Scores', fontsize=14)
plt.xlabel('Variance (logarithmic scale)', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.xscale('log') ## Plotting this by log for better readability
plt.show()
No description has been provided for this image
In [14]:
# Univariate Feature Selection Scores. Using f_regression because this part is more exploratory than anything...
selector = SelectKBest(score_func=f_regression, k='all')
X_new = selector.fit_transform(train_cap_x_df, train_y_df['quality'])
scores = selector.scores_
feature_scores = pd.DataFrame({'Attribute': train_cap_x_df.columns, 'Score': scores}).sort_values(by='Score', ascending=False)#.to_string(index=False)
print("ATTRIBUTE SCORES VIA UNIVARIATE FEATURE SELECTION:")
print(feature_scores)
ATTRIBUTE SCORES VIA UNIVARIATE FEATURE SELECTION:
               Attribute       Score
10               alcohol  896.871903
7                density  388.880430
1       volatile acidity  170.490487
4              chlorides  161.756748
6   total sulfur dioxide  106.233764
0          fixed acidity   55.290669
8                     pH   42.123493
3         residual sugar   37.082675
9              sulphates   15.115154
5    free sulfur dioxide    3.253095
2            citric acid    0.770059
In [15]:
# Plotting the feature scores
plt.figure(figsize=(12, 6))
sns.barplot(x='Score', y='Attribute', data=feature_scores)
plt.title('Univariate Feature Selection Scores', fontsize=14)
plt.xlabel('Score', fontsize=12)
plt.ylabel('Attribute', fontsize=12)
plt.xlim(0, feature_scores['Score'].max() + 50)  # Adjusting the x-axis limit for better readability
plt.show()
No description has been provided for this image
In [16]:
# Recursive Feature Elimination Ranks.
# Please note the lack of cross validation and how this only runs reliably with linear data.
# It is highly model dependent and was mostly used for a simple observational support for future decisions!
model = LinearRegression()
rfe = RFE(estimator=model, n_features_to_select=1)
rfe.fit(train_cap_x_df, train_y_df['quality'])
ranking = rfe.ranking_
feature_ranking = pd.DataFrame({'Rank': ranking, 'Attribute': train_cap_x_df.columns}).sort_values(by='Rank').to_string(index=False)
print("ATTRIBUTE RANKS VIA RECURSIVE FEATURE ELIMINATION:")
print(feature_ranking)
ATTRIBUTE RANKS VIA RECURSIVE FEATURE ELIMINATION:
 Rank            Attribute
    1              density
    2            chlorides
    3     volatile acidity
    4            sulphates
    5              alcohol
    6                   pH
    7       residual sugar
    8        fixed acidity
    9          citric acid
   10  free sulfur dioxide
   11 total sulfur dioxide

More Visualizations¶

In [17]:
## Histograms for all the attributes
plt.figure(figsize=(12, 8))
axes = train_cap_x_df.hist(bins=30, figsize=(15, 10), grid=False)
plt.suptitle("Attribute Histograms", fontsize=14)
for ax in axes.flatten():
    ax.set_xlabel("attribute values")
    ax.set_ylabel("frequency")
plt.tight_layout()  
plt.show()
<Figure size 1200x800 with 0 Axes>
No description has been provided for this image
In [18]:
## Box plots for all the attributes
plt.figure(figsize=(12, 8))
for index, column in enumerate(train_cap_x_df.columns, 1):
    plt.subplot(3, 4, index)
    sns.boxplot(y=train_cap_x_df[column])
    plt.title(column)
plt.suptitle("Attribute Boxplots", fontsize=14)
plt.tight_layout()
No description has been provided for this image
In [19]:
## Attribute correlation matrix
corr_matrix_attr = train_cap_x_df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix_attr, annot=True, fmt='.2f', cmap='coolwarm', square=True)
plt.title("Correlation Matrix of Attributes", fontsize=14)
plt.show()
No description has been provided for this image
In [20]:
target_correlation = train_cap_x_df.corrwith(train_y_df['quality'])
target_correlation_fixed = target_correlation.values.reshape(1, -1) # Reshaping for heatmap
plt.figure(figsize=(10, 2))
sns.heatmap(target_correlation_fixed, annot=True, fmt='.2f', cmap='coolwarm', 
            yticklabels=['Quality'], xticklabels=target_correlation.index)
plt.title("Correlation of Attributes with Target", fontsize=14)
plt.xlabel("Attributes")
plt.show()
No description has been provided for this image
In [21]:
# Attribute-Target Scatter plots
plt.figure(figsize=(12, 8))
for i, col in enumerate(train_cap_x_df, 1):
    plt.subplot(3, 4, i)
    sns.scatterplot(y=train_df[col], x=train_df['quality'], alpha=0.5)
    plt.xticks(range(1, 11))
    plt.title(f"{col}-quality")
plt.suptitle("Attribute-Target Scatter Plots", fontsize=14)
plt.tight_layout()
No description has been provided for this image
In [22]:
## Attribute-Target Pairplot
sns.pairplot(train_df, hue='quality')
plt.suptitle("Target-Attribute Pairplot", y=1.02, fontsize=14)
plt.show()
No description has been provided for this image

5. Handle Missing Values¶

In [23]:
train_df.isna().sum() # There are no missing values in the data set.
Out[23]:
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64
In [24]:
return_dict = sml_utils.get_missingness(train_cap_x_df, missingness_threshold)
missingness_drop_list = return_dict['missingness_drop_list']
fixed acidity missingness = 0.0
volatile acidity missingness = 0.0
citric acid missingness = 0.0
residual sugar missingness = 0.0
chlorides missingness = 0.0
free sulfur dioxide missingness = 0.0
total sulfur dioxide missingness = 0.0
density missingness = 0.0
pH missingness = 0.0
sulphates missingness = 0.0
alcohol missingness = 0.0

missingness_drop_list:
[]

6. Exclude non-ML Attributes¶

In [25]:
train_cap_x_df.columns
Out[25]:
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol'],
      dtype='object')
In [26]:
non_ml_attr_list = [] # no non-machine learning attributes were identified

7. Remove Unwanted Attributes¶

In [27]:
train_cap_x_df.columns
Out[27]:
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol'],
      dtype='object')
In [28]:
ml_attr_drop_list = []

8. ML Attribute Configuration¶

In [29]:
ml_ignore_list = missingness_drop_list + non_ml_attr_list + ml_attr_drop_list
ml_ignore_list
Out[29]:
[]
In [30]:
# All of the attributes in this data set are continuous and numerical
numerical_attr = list(train_cap_x_df.columns)

# There were no nominal attributes found in this data set.
# They are all continuous numerical values from a series of physicochemical tests.
nominal_attr = []

assert(train_cap_x_df.shape[1] == len(ml_ignore_list) + len(nominal_attr) + len(numerical_attr))

print(f'ml_ignore_list: {ml_ignore_list}')
print(f'\nnumerical_attr: {numerical_attr}')
print(f'nominal_attr: {nominal_attr}')
print(f'\nnumber of machine learning attributes: {len(numerical_attr) + len(nominal_attr)}')
print(f'\nnumerical_attr and nominal_attr: {numerical_attr + nominal_attr}')
ml_ignore_list: []

numerical_attr: ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']
nominal_attr: []

number of machine learning attributes: 11

numerical_attr and nominal_attr: ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']

Script Runtime¶

In [31]:
end = time.time()
runtime = (end - start) / 60
print(f'Script runtime: {runtime:.4f} minutes')
Script runtime: 0.4024 minutes